Cambridge Handbook of English Corpus Linguistics

نویسنده

  • Paul Rayson
چکیده

The growing interest in corpus linguistics methods in the 1970s and 1980s was largely enabled by the increased power of computers and the use of computational methods to store and process language samples. Before this, even simple methods for studying language such as extracting a list of all the different words in a text and their immediate contexts was incredibly time consuming and costly in terms of human effort. Only concordances of books of special importance such as the Qur’an, the Bible and the works of Shakespeare were made before the 20 century and required either a large number of scholars or monks or a significant investment in time by a single individual, in some cases more than ten years of their lives. In these days of web search engines and vast quantities of text that is available at our finger tips, the end user would be mildly annoyed if a concordance from a one billion word corpus took more than five seconds to be displayed. Other text rich disciplines can trace their origins back to the same computing revolution. Digital Humanities scholars cite the work of Roberta Busa working with IBM in 1949 who produced his Index Thomisticus, a computer-generated concordance to the writings of Thomas Aquinas. Similarly, lexicographers in the 19 century used millions of handwritten cards or quotation slips but the field was revolutionised in the 1980s with the creation of machine-readable corpora such as COBUILD and the use of computers for searching and finding patterns in the data. This chapter presents an introductory survey of computational tools and methods for corpus construction and analysis. The corpus research process involves three main stages: corpus compilation, annotation, and retrieval (see Rayson 2008). A corpus first needs to be compiled via transcription, scanning, or sampling from on-line sources. Then, the second stage is annotation, through some combination of manual and automatic methods to add tags, codes, and documentation that identify textual and linguistic characteristics. A snapshot of tools and methods that support the first and second stages of the corpus research process are described in sections 2.1 and 2.2. Retrieval tools and methods enable the actual linguistic investigations based on corpora: i.e. frequency analysis, concordances, collocations, keywords and n-grams. These tools are introduced in Section 2.3, together with a brief timeline tracing the historical development of retrieval tools and methods and the current focus on web-based interfaces for mega-corpora. Corpus tools and methods are now being applied very widely to historical

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introduction: Compiling and analysing the Spoken British National Corpus 2014

For over twenty years, the British National Corpus has been one of the most widely known and used corpora. It is almost impossible to attend an international corpus linguistics conference such as Corpus Linguistics, ICAME (International Computer Archive of Modern and Medieval English), AACL (American Association for Corpus Linguistics) or APCLC (Asia Pacific Corpus Linguistics Conference) witho...

متن کامل

Connecting NLP and Language Learning

I Detmar Meurers (2012). Natural Language Processing and Language Learning. Encyclopedia of Applied Linguistics, edited by Carol A. Chapelle. Blackwell. 4193–4205. I Detmar Meurers (2015). Learner Corpora and Natural Language Processing. The Cambridge Handbook of Learner Corpus Research, edited by Sylviane Granger, Gaëtanelle Gilquin and Fanny Meunier. Cambridge University Press. I Luiz Amaral ...

متن کامل

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

978 - 1 - 107 - 04119 - 6 - The Cambridge Handbook of Learner Corpus

Written and spoken data produced by learners has always been a key resource for the study of second language acquisition (SLA). However, for a long time the data used was rather artifi cial, i.e. resulting from highly controlled language tasks , and therefore not necessarily a refl ection of what learners do in more natural communication contexts. In addition, the data samples were usually quit...

متن کامل

Grammatical Error Annotation for Korean Learners of Spoken English

The goal of our research is to build a grammatical error-tagged corpus for Korean learners of Spoken English dubbed Postech Learner Corpus. We collected raw story-telling speech from Korean university students. Transcription and annotation using the Cambridge Learner Corpus tagset were performed by six Korean annotators fluent in English. For the annotation of the corpus, we developed an annota...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013